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Abstract 

We  investigate  dynamical  models  of  human  motion  that 
can  support  both  synthesis  and  analysis  tasks.  Unlike 
coarser  discriminative  models  that  work  well  when  action 
classes  are  nicely  separated,  we  seek  models  that  have  fine- 
scale  representational  power  and  can  therefore  model  sub¬ 
tle  differences  in  the  way  an  action  is  performed.  To  this 
end,  we  model  an  obsen’ed  action  as  an  ( unknown )  linear 
time-invariant  dynamical  model  of  relatively  small  order, 
driven  by  a  sparse  bounded  input  signal. 

Our  motivating  intuition  is  that  the  time-invariant  dy¬ 
namics  will  capture  the  unchanging  physical  characteris¬ 
tics  of  an  actor,  while  the  inputs  used  to  excite  the  system 
will  correspond  to  a  causal  signature  of  the  action  being 
performed.  We  show  that  our  model  has  sufficient  repre¬ 
sentational  power  to  closely  approximate  large  classes  of 
non-stationary  actions  with  significantly  reduced  complex¬ 
ity.  We  also  show  that  temporal  statistics  of  the  inferred 
input  sequences  can  be  compared  in  order  to  recognize  ac¬ 
tions  and  detect  transitions  between  them. 

1.  Introduction 

Analysis  and  synthesis  of  human  motion  is  of  paramount 
importance  in  human-machine  interfaces,  rehabilitation,  se¬ 
curity,  and  entertainment,  just  to  mention  a  few  applica¬ 
tions.  While  pictorial  cues  convey  a  significant  amount  of 
information  on  the  underlying  processes,  we  focus  on  the 
information  encoded  in  the  temporal  evolution  of  the  data. 
We  therefore  assume  that  a  multivariate  time  series  has  been 
abstracted  from  a  person’s  motion,  and  focus  on  identify¬ 
ing  models  of  its  temporal  statistics.  Such  a  representation 
could  be  obtained  from  video  (string  of  pixel  intensities,  ori¬ 
entation  histograms  (HOGs),  joint  angles  of  skeletal  model, 
etc)  or  sensors  worn  by  an  individual.  While  the  data  extrac¬ 
tion  is  by  no  means  trivial,  we  focus  on  the  second  problem 
of  learning  dynamical  models  from  the  time  series. 

For  some  tasks,  such  as  classification  of  distinctive  mo¬ 
tions,  purely  discriminative  models  are  sufficient  [9],  Some 


benchmark  datasets  can  even  be  classified  reliably  taking 
into  account  as  little  information  as  local  shape  and  optical 
flow  in  a  single  frame  [23],  However,  in  situations  where 
the  temporal  order  of  motions  is  significant  (or  perhaps 
the  only  discriminative  information),  or  to  address  more 
subtle  queries  such  as  long-term  or  fine  scale  prediction, 
models  with  generative  capability  and  greater  representa¬ 
tional  accuracy  are  useful.  Since  the  discrete  multinomial 
state  of  generative  models,  such  as  Hidden  Markov  Models 
(HMMs)  [31,  29,  12],  experience  an  exponential  increase 
in  parameters  as  more  signal  history  is  encoded,  we  favor 
dynamic  models  with  continuous  latent  variables  to  pursue 
the  desired  level  of  detail  in  action  representation. 

We  propose  to  view  human  motion  analysis  as  a  blind 
system  identification  where  each  limb  is  an  unknown  lin¬ 
ear  dynamical  system  (LDS)  driven  by  an  unknown  input. 
Intuitively,  the  dynamical  model  represents  physical  char¬ 
acteristics  of  an  actor,  such  as  mass  and  inertia,  whereas  the 
input  represents  the  driving  signal,  a  signature  of  the  action. 
Without  additional  constraints  this  is  an  ill-posed  problem. 
Traditionally,  assumptions  have  been  made  that  the  driv¬ 
ing  input  is  a  process  with  samples  from  some  canonical 
distribution  (typically  a  Gaussian).  These  approaches  were 
successful  at  capturing  observations  with  second-order  sta¬ 
tionary  statistics,  and  therefore  worked  well  for  modeling 
quasi-repetitive  actions  such  as  walking  and  running  [3], 
However,  the  limitations  of  these  models  become  quickly 
apparent  when  one  considers  more  complex  non-stationary 
sequences,  e.g.  Fig.  1.  Our  goal  in  this  work  is  to  be  able  to 
capture  such  non-stationarities  in  human  action  sequences 
and  to  reliably  identify  when  changes  between  distinct  ac¬ 
tions  occur. 

To  render  the  blind  identification  problem  above  well- 
posed  we  constrain  the  dynamics  to  be  linear  and  time- 
invariant  (our  body  masses  do  not  change  at  the  time-scale 
of  observation),  transferring  all  the  non-stationary  char¬ 
acteristics  of  the  observed  time  series  to  the  input.  Ide¬ 
ally,  we  want  a  class  of  inputs  that  would  serve  as  a  sig¬ 
nature  for  actions.  One  logical  option  is  to  assume  that 
the  input  should  be  mostly  zero,  except  when  soliciting  a 
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change  of  elementary  movement.  When  non-zero,  it  should 
have  bounded  energy,  to  avoid  embarrassing  violations  of 
elementary  physics  laws.  This  translates  into  the  prob¬ 
lem  of  performing  blind  identification/deconvolution  under 
bounded  energy  and  sparsity  constraints  on  the  input.  Ex¬ 
ploiting  recent  results  from  convex  optimization  and  sparse 
representations,  in  Sect.  2  through  Sect.  4  we  develop  a  new 
algorithm  for  the  task. 

We  validate  our  model  by  demonstrating  the  ability  to 
accurately  capture  more  complex  actions  than  previous  lin¬ 
ear  dynamical  system  approaches  in  Sect.  5.  The  compres¬ 
sive  power  of  our  sparse  representation  is  also  addressed. 
In  Sect.  6,  we  present  the  application  of  customizable  syn¬ 
thesis  by  modification  of  model  parameters  and  show  that 
our  sparse  representation  supports  segmentation  and  classi¬ 
fication  tasks.  Through  the  segmentation  and  classification 
tasks,  we  validate  our  hypothesis  that  the  input  encodes  sig¬ 
natures  of  individual  actions. 

1.1.  Related  Work 

Understanding  human  actions  is  a  critical  problem  that 
has  received  considerable  attention  in  the  machine  learning 
and  vision  communities.  Our  model  falls  into  the  class  of 
linear  dynamical  systems,  where  the  task  of  motion  mod¬ 
eling  has  been  posed  as  a  system  identification  problem 
[4,  20].  Up  until  now  the  LDS  literature  in  human  motion 
has  assumed  a  stochastic  input  with  a  known  distribution, 
which  limits  the  representational  capability  to  simpler  reg¬ 
ular  actions.  This  motivates  the  use  of  switched-linear  dy¬ 
namical  systems  (SLDS),  in  which  changes  of  the  model 
parameters  enhance  the  ability  of  the  model  to  capture  more 
complex  motions  [19,  15].  In  [18],  an  SLDS  approach  was 
proposed  where  only  the  zeros  of  the  transfer  function  were 
allowed  to  change  across  actions  and  an  HMM  was  used  to 
drive  these  changes.  Works  with  a  similar  spirit  have  used 
switched  autoregressive  (SAR)  systems  to  model  videos. 
Video  segmentation  is  achieved  by  detecting  changes  of  the 
coefficients  of  the  AR  model.  The  identification  of  SAR  has 
been  addressed  as  a  convex  optimization  problem  by  [17], 
and  as  identification  of  homogeneous  polynomials  by  [27], 

Yet  another  perspective  on  capturing  the  non-stationarity 
of  human  actions  are  Gaussian  processes  [28].  These  mod¬ 
els  learn  a  nonlinear  mapping  from  the  observation  space 
into  a  latent  space  and  a  nonlinear  system  in  the  latent  space. 
A  downside  of  this  approach  is  that  it  does  not  provide  in¬ 
formation  which  can  directly  be  used  for  classification  or 
segmentation  of  the  modeled  motion.  Physically  based  non¬ 
linear  temporal  models  have  also  been  used  to  synthesize 
human  motion  [10,  11],  However,  the  process  of  concate¬ 
nating  “basic”  controllers  becomes  too  complex  for  most 
actions  of  interest. 

In  our  case  we  assume  a  single  linear  time-invariant 
model.  We  show  that  by  changing  the  assumptions  on  the 


input  we  increase  the  ability  of  LDS  to  capture  complex 
actions  and  simultaneously  capture  useful  action  character¬ 
istics  in  the  input. 

Like  most  of  the  literature  above  we  focus  on  designing 
dynamical  models  for  actions,  with  no  regard  to  how  the 
time  series  is  extracted.  In  our  experiments,  we  use  motion 
capture  data  to  evaluate  our  approach. 

2.  The  Underlying  Dynamical  System 

Data  observed  from  human  actions,  whether  from  video 
or  motion  capture,  can  be  viewed  as  a  multivariate  time  se¬ 
ries.  The  core  hypothesis  of  this  work  is  that  such  multivari¬ 
ate  time  series  y(t )  £  are  outputs  of  a  linear  time  invari¬ 
ant  dynamical  system  driven  by  a  one  dimensional  sparse 
and  bounded  input,  u{t)  £  Si.  The  dynamical  system  is  de¬ 
fined  by  its  system  matrices  A  S  Knxrl,f?  £  M”xl,(7  £ 
Rpxn,  a  state  vector,  x(t)  £  R™.  Above,  n  is  the  order  of 
the  LDS  and  p  is  the  dimension  of  the  observation.  This 
model  can  be  expressed  as  follows: 

H-)lk  <  k 

\u(t)\  <1  Vi 

<  EjCMVBUt  < /x  (1) 

x(t  +  1)  =  Ax(t)  +  Bu(t) 

_  y(t)  =  Cx(t). 

where  ||tt||r0  is  the  number  of  nonzero  elements  in  the  input 
sequence  u.  We  can  write  (1)  as: 

(  \\U\Uo<k 

H-,1  V*.  (2) 

]  Eto  WCA'Bh  <  (jl 

{  Y  =  rx0  +  HU 

where  i  is  the  discrete  index,  N  is  the  length  of  the  sig¬ 
nal,  Y  =  [yo,yf  ...,yJ]_1]T,  U  =  [u0,ui,  . . .  ,uN_1]T, 
Xq  £  M'"  is  the  initial  condition  of  the  system,  and 
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Our  representation  consists  of  a  total  of  7  systems  in  the 
form  of  (2)  that  model  body  pose  along  with  global  posi¬ 
tion  and  orientation.  Pose  systems  are  learned  from  Euler 
angles  grouped  into  5  multidimensional  time  series  for  the 
body  (torso,  2  arms,  and  2  legs).  The  other  2  systems  are 
learned  from  absolute  3D  positions  and  orientation  angles 
respectively. 


The  intuition  behind  the  sparsity  constraint  on  the  input 
is  to  limit  our  solution  space  in  such  a  way  as  to  force  as 
much  of  the  stationary  dynamics  as  possible  into  the  sys¬ 
tem  parameters.  This  way  we  hope  to  view  our  input  as  a 
triggering  mechanism,  a  spike  train  sequence,  that  is  a  char¬ 
acteristic  signature  of  the  action.  As  shown  in  Fig.  1,  the 
inputs  found  by  our  method  given  these  constraints  typi¬ 
cally  consist  of  sequences  of  impulses.  It  can  thus  be  said 
that  our  representation  interprets  actions  as  a  superposition 
of  impulse  responses. 

One  more  aspect  of  our  model  is  bounding  the  input. 
This  forces  variables  such  as  the  amplitude  of  an  action  into 
the  system  matrices,  thus  resulting  in  inputs  that  are  more 
comparable  across  actions  and  individuals.  Moreover,  the 
unit  impulse  response  of  the  system  is  bounded  to  prevent 
degenerate  solutions  due  to  scale  ambiguity:  ||fT||i  — >  oo 
and  ||w||i  — >■  0. 

3.  Identification  with  Sparse  Bounded  Input 

Instead  of  seeking  to  minimize  the  one-step  prediction 
error,  as  in  HMMs  and  autoregressive  models,  we  focus  on 
the  full  simulation  error: 

minimize  IIU  —  TXo  —  HU\\n 
U,X0,A,B,C 

subject  to:  ||f/||r0  <  k 

\ui\  <1  ,i  =  0, . . .  ,N  —  1  (3) 

N- 2 

E  WCAlBh  <  t* 

2= 0 

where  Y  is  the  observed  time  series.  It  is  well  known  that 
the  minimization  in  (3)  is  NP-hard,  thus  we  relax  the  prob¬ 
lem  to  a  weighted  i\  minimization  [25]: 

JV-l 

minimize  \\Y  -  TXa  -  HU\\l  +  A  E 

2—0 

subject  to:  \ui\  <  1  ,  i  =  0, . . . ,  N  —  1  (4) 

TV- 2 

E  IICATBU!  <  M. 

2=0 

The  form  above  adds  a  regularizer  term,  with  A  serving  as 
the  tradeoff  between  accuracy  of  fit  and  sparsity. 

3.1.  Alternating  Minimization 

Our  approach  to  solve  (4)  is  similar  in  spirit  to  algorithms 
for  learning  dictionaries  to  sparsely  represent  images  [16]. 
In  our  case,  however,  the  dictionary  is  the  impulse  response 
of  the  linear  dynamical  system  H. 


Algorithm: 

1.  Select  the  order  of  the  system1:  n 

2.  Initialize  a  random  sparse  input  U  satisfying  the  con¬ 
straint:  |ztj|  <  1,  V* 

3.  Repeat 

(a)  Given  U : 

Identify  a  LDS:  A,  B,  C,  X0 
Scale  B  =  min  (l,  ^\\ca*B\u)  '  B 

(b)  Given  X0,  T  and  H: 

N-l 

minimize  ||F  —  TX0  —  HU\\\  +  A  E  wi\ui\ 
U  2=0 
subject  to:  |u,  |  <  1  i  =  0, . . . ,  N  —  1 

(5) 


For  estimating  the  A  and  C  matrices  of  the  LDS  we  use 
the  subspace  identification  algorithm  for  deterministic  sys¬ 
tems  [26]  with  the  constraint  that  A  must  be  stable.  For 
this  purpose  we  adopt  the  method  [24],  which  incremen¬ 
tally  adds  constraints  to  a  quadratic  program  to  improve  the 
stability  of  the  estimated  system  matrix.  Having  estimated 
A  and  C,  the  estimation  of  B  and  X0  is  the  least-squares 
solution  of  the  simulation  error  [7], 

3.2.  Enhancing  Sparsity 

The  sparsity  of  the  result  obtained  by  solving  a  uniform 
weighted  l\  -  regularized  least-squares  formulation  (5)  can 
be  further  enhanced  by  incorporating  an  iterative  reweight¬ 
ing  scheme  [6].  Step  3(b)  of  the  algorithm  above  is  thus 
modified  as  follows: 

1.  Initialize  the  weights  :  wf ^  =  1,  i  =  0, . . . ,  N  —  1. 

2.  Solve  the  weighted  minimization  problem 

JV-l 

U ^  =  argmin||y  —  FA'0  —  HU\\\  +  A  E  wi\ui\ 

2=0 

subject  to:  |ttj|  <  1  i  =  0, . . . ,  N  —  1 

3.  Update  the  weights:  wf+^  =  ^  ^  • 

4.  Terminate  on  convergence  or  when  l  =  lmaxiter- 

1  the  order  of  the  system  is  verified  during  system  identification  [26], 


Estimated  Input  Signal 


Figure  1.  This  figure  summarizes  how  our  model  captures  actions  and  compares  the  representational  power  of  sparse  input  driven  LDS  with 
that  of  traditional  stochastic  LDS  for  non- stationary  actions.  The  top  plot  illustrates  the  input  to  the  inferred  LDS  that  drives  a  person’s 
right  leg  during  a  dance.  The  output  of  the  right  leg  LDS  corresponds  to  the  9  dimensions  of  the  original  joint  angle  time  series.  In  the 
center  plot  we  show  that  over  the  course  of  the  dance  we  capture  the  joint  angle  of  the  right  hip,  one  of  the  leg’s  dimensions,  with  a  median 
error  of  3.57  degrees  and  a  mean  absolute  error  of  4.61  degrees  with  a  standard  deviation  of  3.98  degrees.  The  original  signal  is  shown  in 
blue  and  our  corresponding  synthesis  is  shown  in  red.  In  the  bottom  plot  the  synthesis  result  of  an  LDS  driven  by  Gaussian  noise  is  shown 
with  a  solid  black  line.  The  dashed  red  line  shows  the  5-step  prediction  of  the  same  stochastic  system,  and  the  original  signal  appears  in 
blue. 


4.  Large  Scale  t\  minimization 

Estimating  the  input  of  a  linear  time-invariant  (LTI)  sys¬ 
tem,  U,  using  t\  regularization  is  computationally  inten¬ 
sive  and  becomes  a  challenging  problem  when  the  length 
of  our  observation  is  several  thousand  samples.  However, 
we  can  reduce  the  computational  cost  significantly  by  ex¬ 
ploiting  the  Toeplitz  structure  of  the  problem.  The  multipli¬ 
cation  of  a  Toeplitz  matrix  with  a  vector  can  be  performed 
in  0(N  log  N)  instead  of  0(N2).  In  our  experiments  we 
use  the  truncated  Newton  interior-point  method  proposed 
by  Kim  et  al.  [13],  modified  according  to  the  specific  con¬ 
straints  of  our  formulation.  In  the  situation  where  the  output 
is  multivariate,  H  can  be  represented  with  p  Toeplitz  matri¬ 
ces  to  maintain  efficiency  during  multiplication. 

4.1.  Primal  and  Dual  Problem 

In  order  to  use  the  duality  gap  to  establish  convergence 
criteria  for  the  minimization,  we  derive  the  dual  problem 
here.  Our  initial  problem  is: 

IV- 1 

minimize  \\HU  —  fill?  +  A  >  wAuA 

u  11  ^  1  1  (6) 

»= o  K  J 

subject  to:  \U\  A  1 

where  y  =  Y  —  TXo.  We  change  variables  Ui  =  WiUi 
and  introduce  the  diagonal  matrix  D  =  diag( w),  to  trans¬ 


form  to  an  unweighted  i\  regularized  problem,  where  w  = 
[wo, . . . ,  W]sr_i\T .  Afterward,  we  introduce  a  new  variable 
2  £  M.n ,  a  new  equality  constraint  z  =  HD ~1u  —  y,  and 
make  the  box  constraints  implicit  [5], 

JV-l 

minimize  zT z  +  X  |{q| 

-W-<  U  ^W,2  '  1  1  n) 

-  ~  i= 0  v  ’ 

subject  to:  z  =  HD~1u  —  y. 

The  dual  function  of  (7)  is: 

g(v)  =  inf  (zT  z  +  A||u||i  +  vT(HD~1;u,  —  y  —  z)) 

T 

V  V 

~4 

wt((D_17Ttzz  +  Al)-  +  (D~1Htv  -  Al)  +  ) 

where  qf  =  max(<7,;,0),  q~  =  max(— ^,0).  Any  dual 
feasible  point  v  gives  a  lower  bound  on  the  optimal  value  of 
the  primal  problem  (7). 

4.2.  Truncated  Newton  Interior-Point  Method 

The  l\  regularized  least-squares  problem  (7)  can  be 
transformed  to  a  convex  quadratic  problem,  with  linear  in- 


equality  constraints. 

JV-l 

minimize  zT  z  +  A  Vi 

U,V  ' 

i=0  (8) 

subject  to: 

2  =  HD~1;a  —  y\  —  w  A  {j  A  w;  —  A  {j  A 

In  this  part  we  incorporate  an  interior-point  method  for  solv¬ 
ing  our  convex  optimization.  We  first  define  the  logarithmic 
barrier  for  the  bound  constraints  in  (8): 

JV-l  JV-l 

v)  =  -^2  log(vi  +  Ui)  -  *^2  log(,Vi  -  Ui ) 

2—0  2  =  0 

N-l  N—l 

-  ^2  l°g(wi  +  «i)  -  ^2  lo9(Wi  ~  “*)•  (9) 
2=0  2=0 

The  central  path  consists  of  the  unique  minimizer 
(x*(t),  v*(t))  of  the  convex  function  as  the  parameter  r 
varies  from  0  to  oo: 

JV-l 

4>t(u ,  v)  =  t\\HD~1il  -  y\\  +  t\  ^2  Vi  +  $(u,  v).  (10) 

i= o 

In  order  to  minimize  <j>T(u,  v),  the  search  direction  is  com¬ 
puted  as  an  approximate  solution  to  the  Newton  system,  us¬ 
ing  Preconditioned  Conjugate  Gradient  [13]. 

5.  Experimental  Evaluation 
5.1.  Datasets 

The  FutureLight  action  dataset  [22]  is  a  collection  of 
5  actions,  performed  with  significant  intra  and  inter-class 
variations:  “Dance”,  “Jump”,  “Sit”,  “Run”,  and  “Walk”. 
The  durations  of  captured  actions  vary  from  100  to  over  800 
frames.  We  applied  our  learning  algorithm  to  the  full  joint 
angle  representations  of  all  158  samples  in  the  dataset.  In 
all  cases  we  used  models  of  order  n  =  10,  with  the  excep¬ 
tion  of  “Sit”  actions  which  were  estimated  with  order  n  =  8 
due  to  the  small  number  of  available  frames.  We  performed 
the  deconvolution  using  the  sparsity  enhancing  reweighting 
scheme  with  A  =  10  and  e  =  0.005.  We  use  FutureLight  in 
Sect.  5.2  and  Sect.  6.3  to  demonstrate  accuracy  and  explore 
the  supervised  classification  task. 

To  test  our  hypothesis  about  capturing  action  signatures 
in  the  inferred  input  we  also  obtained  6  long  sequences  from 
the  CMU  Motion  Capture  Database2,  in  each  of  which  a 
single  actor  performs  several  actions  in  succession.  For  in¬ 
stance,  subject  86,  sequence  3,  contains  smooth  transitions 
between  a  number  of  sports  related  actions  including  walk¬ 
ing,  running,  jumping,  kicking,  stretching,  and  even  jump- 
kicking.  We  used  the  same  parameter  settings  as  in  the  Fu¬ 
tureLight  dataset  (n  =  10).  In  Sect.  6.2  we  show  that  by 

2 We  used  ~5000  frames  from  subject  86,  sequences  1,  2,  3,  5,  6,  7. 


taking  simple  statistics  on  the  inferred  input  we  were  able  to 
accurately  classify  the  actions  performed  and  localize  their 
transitions  (Fig.  2). 

5.2.  Accuracy  and  Compression 

The  least  requirement  for  a  model  is  that  it  captures  the 
statistics  of  the  data  with  smaller  complexity  than  the  data 
itself.  We  show  that  our  model  achieves  this  task  by  assess¬ 
ing  the  accuracy  of  our  reconstruction  and  sparsity  of  the 
inferred  input. 

First,  we  show  some  qualitative  results  of  a  complex  non¬ 
stationary  dance  sequence  in  Fig.  1.  From  Fig.  1,  it  is  clear 
that  our  model  captures  motions  accurately  where  typical 
Gaussian  noise  driven  LDSs  of  similar  complexity  experi¬ 
ence  a  significant  lack  of  representational  power.  For  a  more 
extensive  evaluation,  in  Table  2,  we  report  the  error  in  rep¬ 
resenting  the  position  ( X ,  Y,  Z)  and  joint  angles,  expressed 
in  Euler  angles,  for  the  5  actions  in  the  FutureLight  dataset. 
Further,  in  Table  1,  we  compare  the  mean  reconstruction  er¬ 
ror  for  these  joint  angles  modeled  with  different  approaches. 
Our  model  ( Y  (eq.  2))  achieves  the  smallest  reconstruction 
error,  illustrating  that  it  captures  the  signal  more  accurately 
than  Gaussian  noise  driven  LDS,  even  when  the  latter  sys¬ 
tems  are  given  the  added  benefit  of  using  information  from 
5  time  steps  in  the  past. 


Mean  Absolute  Error  (°) 

Our  Model  (Simulation) 

4.96 

Stoch.  LDS  (Simulation) 

17.70 

Stoch.  LDS  (5-step  prediction) 

5.29 

Stoch.  LDS  (10-step  prediction) 

6.74 

Table  1.  Comparison  with  other  methods. 


In  addition  to  modeling  individual  actions  well,  we  ob¬ 
served  that  in  the  CMU  sequences  our  model  could  cap¬ 
ture  successive  actions  and  their  transitions  with  a  single 
dynamic  system.  Typically,  such  transitions  between  ac¬ 
tions  are  difficult  to  capture  and  historically  have  even  been 
treated  as  independent  action  classes.  Results  using  our  in¬ 
ference  on  these  sequences  are  discussed  in  Sect.  6.2  and 
Fig.  2. 

Even  though  synthesis  is  a  valid  method  of  evaluating 
what  we  capture,  it  is  not  the  key  goal  of  our  model.  Thus 
we  do  not  focus  on  adding  any  kinematic  or  smoothness 
constraints,  as  is  often  done  in  graphics  literature  [30,  1  ]  to 
generate  lifelike  motions. 

Finally,  we  compute  that  on  average,  in  FutureLight, 
78.84%  of  the  input  signal  values  are  zero,  confirming  that 
the  inferred  signal  is  sparse.  An  advantage  that  comes  with 
using  our  sparse  input  LDS  representation  is  the  compres¬ 
sive  quality  of  the  models.  For  a  leg,  whose  original  N- 
length  time  series  has  9  dimensions,  this  representation  typ¬ 
ically  reduces  to  only  8.4%  of  the  original  size  across  all 


Dance 

XYZ 

Dance 
Angles  (°) 

Jump 

XYZ 

Jump 

Angles  (°) 

Sit 

XYZ 

Sit 

Angles  ( ° ) 

Run 

XYZ 

Run 

Angles  (°) 

Walk 

XYZ 

Walk 
Angles  (°) 

Mean  Absolute  Error 

1.38 

4.50 

1.50 

5.83 

1.63 

5.89 

1.46 

5.71 

1.55 

1.90 

Standard  Deviation  of  Absolute  Error 

1.05 

3.69 

1.13 

5.01 

1.21 

4.84 

1.31 

4.75 

1.74 

2.38 

Median  of  Absolute  Error 

1.16 

3.66 

1.28 

4.64 

1.37 

4.72 

1.15 

4.61 

1.08 

2.38 

Table  2.  Representational  power  of  our  model  as  evaluated  on  the  FutureLight  dataset.  The  errors  for  angle  measurements  are  in  degrees. 
The  3D  position  errors  are  reported  in  the  units  of  the  motion  capture  data  (inches  scaled  by  a  factor  of  0.45). 


actions  (not  counting  the  constant  overhead  to  store  the  dy¬ 
namical  system  parameters). 

6.  Applications 

Aside  from  providing  a  concise  and  accurate  representa¬ 
tion  of  complex  actions,  our  model  allows  for  a  number  of 
other  applications.  In  this  section,  we  outline  possibilities 
of  how  our  model  can  be  leveraged,  and  explore  our  hy¬ 
potheses  regarding  the  information  encoded  by  the  inferred 
input. 

6.1.  Creative  Synthesis  of  Actions 

Once  a  model  is  learned,  the  inputs  and  parameters  of 
each  dynamical  system  are  directly  available  and  can  be 
controlled  purposefully  in  ways  similar  to  Doretto  et  al.  for 
dynamic  textures  [8],  For  example,  we  can  change  the  in¬ 
tensity  of  the  motion  by  scaling  the  C  matrix  of  the  system. 
This  type  of  creative  editing  of  the  dynamics  and  input  can 
result  in  interesting  variations  on  an  original  action.  Ex¬ 
amples  are  illustrated  at  http :  / / vision  .  ucla  .  edu/ 
-mraptis/ spikes 

6.2.  Unsupervised  Action  Segmentation 

Having  observed  that  our  model  is  strong  enough  to  cap¬ 
ture  transitions  between  distinct  actions,  we  can  pose  the  in¬ 
verse  problem  of  detecting  such  transitions  from  data.  For 
this  we  use  the  six  long  sequences  from  the  CMU  dataset. 
Since  we  constrain  the  dynamics  to  be  constant  throughout 
a  single  sequence,  finding  transitions  corresponds  to  tem¬ 
poral  segmentation  of  the  input.  We  go  one  step  beyond 
simple  segmentation  and  identify  repeated  patterns  in  the 
input  that  correspond  to  distinct  actions.  With  this  experi¬ 
ment  we  therefore  test  the  hypothesis  that  the  input  signal 
captures  signatures  of  observed  actions. 

To  segment  and  classify  the  spike-trains  we  construct 
histograms  of  input  signal  intensities  in  a  window  around 
each  frame,  capturing  the  local  statistics.  Each  histogram 
is  quantized  into  11  bins,  equally  spaced  in  a  range  from  -1 
to  1.  To  encode  information  from  the  inputs  to  all  of  the 
body’s  systems  the  histograms  for  each  limb  and  torso  are 
stacked  to  create  a  frame  descriptor.  We  then  use  the  lossy 
coding  approach  of  [14]  to  produce  an  unsupervised  seg¬ 
mentation.  To  encourage  temporal  coherence,  we  initially 
restrict  merging  to  only  neighboring  segments,  using  a  low 


p  distortion  parameter  (p  =  25).  This  also  significantly 
reduces  the  computational  cost.  After  convergence  of  the 
first  merging  the  neighbor  restriction  is  removed  in  order  to 
cluster  repeating  actions.  For  this  final  clustering  we  look 
for  the  two  most  stable  segmentations  across  a  range  of  p 
values  (p  £  [30, 130])  and,  of  those,  select  the  segmentation 
with  the  minimum  p.  Given  the  input  identified  by  our  algo¬ 
rithm,  this  full  segmentation  procedure  takes  approximately 
1  minute  to  run  on  a  5000  frame  sequence.  Our  results  are 
shown  in  Fig.  2. 

Fig.  2  also  provides  a  comparison  of  our  input-based 
segmentation  results  with  existing  algorithms  for  segment¬ 
ing  temporal  data  that  operate  directly  on  observed  val¬ 
ues.  We  compare  with  Barbie  et  al.  [2],  who  proposed  a 
change  detection  algorithm  based  on  the  reprojection  error 
on  the  principal  components  computed  in  a  sequential  fash¬ 
ion.  Their  algorithm  detects  changes  in  motions  relatively 
accurately,  however  it  does  not  cluster  the  distinct  actions. 
The  method  of  Vidal  [27]  models  the  first  two  principal 
components  of  the  data  as  a  first  order  Switched  Autore¬ 
gressive  Exogenous  Model  and  identifies  the  model’s  coeffi¬ 
cient  recursively.  Change  of  the  model’s  coefficient  implies 
change  of  human  motion.  Similarly,  Ozay  et  al.  [17]  model 
the  first  three  principal  components  of  the  data  as  a  third 
order  switched  autoregressive  model  with  piecewise  con¬ 
stant  coefficients.  The  coefficients  are  then  clustered  with 
K -means  (with  K  manually  selected  to  the  optimal  num¬ 
ber  for  each  sequence).  The  segmentation  results  of  both 
models  are  shown  in  Fig.  2  for  comparison. 

As  a  quantitative  measure  of  the  performance  of  our 
unsupervised  clustering,  we  compared  the  areas  of  the  re¬ 
gions  segmented  with  the  areas  of  the  ground  truth  segmen¬ 
tation  provided  by  [2],  Since  transitions  between  actions 
are  typically  smooth,  (thus  there  are  no  “true”  transition  in¬ 
stants),  labels  assigned  to  regions  ground  truthed  as  transi¬ 
tions  are  not  counted  towards  or  against  the  classification 
score.  Fabels  assigned  by  our  algorithm  were  considered 
correct  when  they  matched  across  repeated  actions  within  a 
sequence  and  when  they  were  unique  for  actions  appearing 
only  once.  With  this  metric  within-class  oversegmentation 
does  not  count  against  us  as  long  as  it  is  consistent  when 
that  action  class  is  repeated  (this  is  observed  in  the  “Ro¬ 
tate  Body”  actions  in  sequence  7).  Averaging  over  the  6  se¬ 
quences,  we  obtained  a  mean  classification  rate  of  90.94% 
according  to  the  above  metric.  For  the  SAR  method  the 
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Actions:  W:  walk,  JU:  jump  up.  P:  punch,  K:  kick,  SQ:  squat,  ST:  stand,  J 1 :  jump  on  1  leg, 

JF:  jump  forward,  JA:  jack,  RB:  rotate  body,  RA:  rotate  arms,  R:  run 
Algorithms:  STS:  Spike  Train  Segmentation  (our  Method),  PCA:  [  ] ,  HUMAN:  ground  truth  segmentation,  RSARX:  [  7],  SAR:  [17] 


Figure  2.  [Best  viewed  in  color]  Here  we  show  the  performance  of  our  temporal  segmentation/action  recognition  on  complex  CMU  MoCap 
data  (subject  86,  sequences  1,2, 3, 5, 6, 7;  shown  left  to  right,  top  to  bottom).  Colors  correspond  to  label  values,  thus  regions  marked  with 
the  same  color  are  those  that  have  been  clustered  as  the  same  action.  In  sequence  2  we  notice  that  transition  regions  are  often  identified  as 
unique.  This  is  due  to  the  fact  that  transition  regions  are  often  smooth  and  do  not  exhibit  regular  statistics  like  their  neighboring  actions.  In 
sequence  3  we  notice  that  “walk”  and  “run”  are  confused.  Likewise  in  sequence  5  there  is  confusion  between  “jump  up”,  "jump  forward”, 
and  “jump  on  one  leg”.  In  sequence  6  we  oversegment  “run”  into  2  different  labels,  however  neither  of  these  are  confused  with  any  other 
action  in  the  sequence.  Finally  the  most  interesting  and  exciting  result  is  sequence  7.  Here  we  have  oversegmented  the  “rotate  body” 
actions,  but  have  broken  them  down  into  very  regular  components.  We  see  that  each  instance  of  “rotate  body”  is  composed  of  a  starting 
transition  (brown),  two  alternating  short  actions  (light  blue,  dark  blue),  and  an  ending  transition  (pink). 


mean  classification  rate  was  72.27%.  Our  result  illustrates 
that  the  distinct  complex  patterns  of  the  observed  data  were 
accurately  captured  as  patterns  of  the  sparse  input  signals. 

6.3.  Supervised  Classification 

Without  any  supervision  we  deconvolve  the  158  samples 
of  the  FutureLight  dataset.  We  then  classify  our  observa¬ 
tions  using  only  statistics  of  the  input  signals  defining  the 
relative  pose  of  the  actor  (the  actor’s  global  position  and  ori¬ 
entation  are  neglected).  We  extract  features  for  each  sparse 
signal  with  a  sliding  window  of  length  50  and  step  size  of 
16.  Within  each  window  the  features  we  extract  include  the 
percentage  of  zero  elements,  the  percentage  of  successive 
non-zero  elements  that  maintain  the  same  sign,  as  well  as 
the  percentage  of  successive  non-zero  elements  that  change 
sign. 

A  dictionary  is  created  from  the  extracted  features  with 
K -means  clustering  ( K  =  12),  and  each  extracted  win¬ 
dow  is  projected  onto  the  dictionary.  At  this  stage,  each  of 
the  5  input  signals  representing  an  actor’s  body  is  translated 
into  a  sequence  of  labels.  To  take  into  account  the  temporal 
alignment  of  labels  but  also  utilize  support  vector  machines 
(S  VM)  with  RBF  kernel,  we  use  the  Smith- Waterman  based 
technique  described  in  [21]  for  classification.  Classification 
performance  is  evaluated  with  a  leave-one  out  cross  valida¬ 


Dance 

Jump 

Sit 

Run 

Walk 

Dance 

24 

2 

2 

3 

Jump 

2 

11 

1 

Sit 

1 

34 

Run 

3 

3 

23 

1 

Walk 

5 

43 

Table  3.  Confusion  Matrix  of  Future  Light  Dataset.  Overall  mean 
performance  83.87% 

tion  approach,  as  has  been  used  throughout  the  literature. 
We  illustrate  our  classification  results  in  Table  3,  and  com¬ 
pare  with  other  methods  in  Table  4. 

These  results  confirm  that  basic  features  of  the  sparse 
input  signals  capture  characteristics  of  the  observed  time 
series.  In  this  scenario,  patterns  were  found  to  be  char¬ 
acteristic  of  action  classes  despite  inference  of  model  pa¬ 
rameters  taking  place  independently  for  each  action  exam¬ 
ple.  We  achieve  reasonably  good  performance  on  this  task, 
but  do  not  quite  match  discriminative  approaches  that  con¬ 
struct  dictionaries  directly  on  the  multi-dimensional  obser¬ 
vation  signals.  However,  we  make  the  point  that  our  model 
generalizes  to  other  tasks  such  as  segmentation  and  synthe¬ 
sis,  all  of  which  it  performs  in  a  satisfactory  fashion,  while 
the  other  methods  lack  these  capabilities.  We  also  suspect 
that  performance  on  supervised  classification  could  be  fur¬ 
ther  boosted  by  incorporating  prior  knowledge  of  the  action 
classes  into  the  deconvolution  procedure,  however  we  leave 


FutureLight 

[21] 

98.03 

[22] 

89.7 

Spike  Train  Classification 

83.63  ±  1.23 

Table  4.  Comparison  of  classification  results. 


this  investigation  for  future  work. 

7.  Conclusion 

We  have  proposed  a  new  and  efficient  alternating  mini¬ 
mization  algorithm  for  blind  identification  of  linear  dynami¬ 
cal  systems  driven  by  sparse  inputs.  By  applying  our  model 
to  a  wide  range  of  publicly  available  motion  capture  data, 
we  have  shown  that  this  new  class  of  models  is  powerful 
enough  to  capture  non-stationarities  of  human  motions.  Fi¬ 
nally,  through  both  supervised  and  unsupervised  segmenta¬ 
tion  and  classification  experiments  we  have  demonstrated 
that  our  model  is  able  to  capture  characteristic  signatures  of 
the  observation  in  the  inferred  inputs.  This  makes  it  useful 
for  analyzing  sequences  of  various  actions  and  applications 
where  temporal  ordering  and  representational  accuracy  are 
important. 

Although  we  use  motion-capture  data  to  evaluate  our  dy¬ 
namical  models,  the  ultimate  goal  is  to  use  these  models  to 
infer  and  classify  time  sequences  of  video,  both  at  the  low- 
level  (detection  and  tracking)  and  at  the  high-level  (action 
recognition). 
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